Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic
نویسندگان
چکیده
In this paper, we experiment with using Stagger, an open-source implementation of an Averaged Perceptron tagger, to tag Icelandic, a morphologically complex language. By adding languagespecific linguistic features and using IceMorphy, an unknown word guesser, we obtain stateof-the-art tagging accuracy of 92.82%. Furthermore, by adding data from a morphological database, and word embeddings induced from an unannotated corpus, the accuracy increases to 93.84%. This is equivalent to an error reduction of 5.5%, compared to the previously best tagger for Icelandic, consisting of linguistic rules and a Hidden Markov Model.
منابع مشابه
Tagging the Past: Experiments using the Saga Corpus
There is an increasing interest in the NLP community in developing tools for annotating historical data, for example, to facilitate research in the field of corpus linguistics. In this work, we experiment with several PoS taggers using a sub-corpus of the Icelandic Saga Corpus. This is carried out in three main steps. First, we evaluate taggers, which were trained on Modern Icelandic, when tagg...
متن کاملTagging Icelandic Text using a Linguistic and a Statistical Tagger
We describe our linguistic rule-based tagger IceTagger, and compare its tagging accuracy to the TnT tagger, a state-of-theart statistical tagger, when tagging Icelandic, a morphologically complex language. Evaluation shows that the average tagging accuracy is 91.54% and 90.44%, obtained by IceTagger and TnT, respectively. When tag profile gaps in the lexicon, used by the TnT tagger, are filled ...
متن کاملIcelandic Data Driven Part of Speech Tagging
Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our sy...
متن کاملImproving the PoS tagging accuracy of Icelandic text
Previous work on part-of-speech (PoS) tagging Icelandic has shown that the morphological complexity of the language poses considerable difficulties for PoS taggers. In this paper, we increase the tagging accuracy of Icelandic text by using two methods. First, we present a new tagger, by integrating an HMM tagger into a linguistic rule-based tagger. Our tagger obtains state-of-the-art tagging ac...
متن کاملFurther Results and Analysis of Icelandic Part of Speech Tagging
Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our sy...
متن کامل